我在Spark中将多个html文件读入数据帧。 我使用自定义udf
将html的元素转换为数据框中的列val dataset = spark
.sparkContext
.wholeTextFiles(inputPath)
.toDF("filepath", "filecontent")
.withColumn("biz_name", parseDocValue(".biz-page-title")('filecontent))
.withColumn("biz_website", parseDocValue(".biz-website a")('filecontent))
...
def parseDocValue(cssSelectorQuery: String) =
udf((html: String) => Jsoup.parse(html).select(cssSelectorQuery).text())
哪种方法很完美,但每次withColumn
调用都会导致解析html字符串,这是多余的。
有没有办法(不使用查找表等)我可以根据" filecontent"生成1个已解析的文档(Jsoup.parse(html)
)。每行一列,并使其适用于数据框中的所有withColumn
次调用?
或者我不应该尝试使用DataFrames并只使用RDD'
答案 0 :(得分:0)
我可能会按如下方式重写它,进行解析并一次性选择并将它们放在临时列中:
/// <summary>
/// The method FindCommonItems, returns a list of all the COMMON ITEMS in the lists contained in the listOfLists.
/// The method expects lists containing NO DUPLICATE ITEMS.
/// </summary>
/// <typeparam name="T"></typeparam>
/// <param name="allSets"></param>
/// <returns></returns>
public static List<T> FindCommonItems<T>(IEnumerable<List<T>> allSets)
{
Dictionary<T, int> map = new Dictionary<T, int>();
int listCount = 0; // Number of lists.
foreach (IEnumerable<T> currentSet in allSets)
{
int itemsCount = currentSet.ToList().Count;
HashSet<T> uniqueItems = new HashSet<T>();
bool duplicateItemEncountered = false;
listCount++;
foreach (T item in currentSet)
{
if (!uniqueItems.Add(item))
{
duplicateItemEncountered = true;
}
if (map.ContainsKey(item))
{
map[item]++;
}
else
{
map.Add(item, 1);
}
}
if (duplicateItemEncountered)
{
uniqueItems.Clear();
List<T> duplicateItems = new List<T>();
StringBuilder currentSetItems = new StringBuilder();
List<T> currentSetAsList = new List<T>(currentSet);
for (int i = 0; i < itemsCount; i++)
{
T currentItem = currentSetAsList[i];
if (!uniqueItems.Add(currentItem))
{
duplicateItems.Add(currentItem);
}
currentSetItems.Append(currentItem);
if (i < itemsCount - 1)
{
currentSetItems.Append(", ");
}
}
StringBuilder duplicateItemsNamesEnumeration = new StringBuilder();
int j = 0;
foreach (T item in duplicateItems)
{
duplicateItemsNamesEnumeration.Append(item.ToString());
if (j < uniqueItems.Count - 1)
{
duplicateItemsNamesEnumeration.Append(", ");
}
}
throw new Exception("The list " + currentSetItems.ToString() + " contains the following duplicate items: " + duplicateItemsNamesEnumeration.ToString());
}
}
List<T> result= new List<T>();
foreach (KeyValuePair<T, int> itemAndItsCount in map)
{
if (itemAndItsCount.Value == listCount) // Items whose occurrence count is equal to the number of lists are common to all the lists.
{
result.Add(itemAndItsCount.Key);
}
}
return result;
}
答案 1 :(得分:0)
所以最后的答案其实很简单:
只需映射行并在那里创建对象
def docValue(cssSelectorQuery: String, attr: Option[String] = None)(implicit document: Document): Option[String] = {
val domObject = document.select(cssSelectorQuery)
val domValue = attr match {
case Some(a) => domObject.attr(a)
case None => domObject.text()
}
domValue match {
case x if x == null || x.isEmpty => None
case y => Some(y)
}
}
val dataset = spark
.sparkContext
.wholeTextFiles(inputPath, minPartitions = 265)
.map {
case (filepath, filecontent) => {
implicit val document = Jsoup.parse(filecontent)
val customDataJson = docJson(filecontent, customJsonRegex)
DataEntry(
biz_name = docValue(".biz-page-title"),
biz_website = docValue(".biz-website a"),
url = docValue("meta[property=og:url]", attr = Some("content")),
...
filename = Some(fileName(filepath)),
fileTimestamp = Some(fileTimestamp(filepath))
)
}
}
.toDS()